Connections to other diagnoses

How Relative Risk is calculated:

N: Total number of patients
p(A) = num(A)/N: probability that a patient has a diagnosis A
p(B) = num(B)/N: probability that a patient has a diagnosis B
observed: Number of patients that have both diagnosis A and B.
expected = p(A)⋅p(B)⋅N
Relative Risk = observed/expected

Conditional probability:

p(B|A): If patient has diagnosis A, what is the probability that they also have diagnosis B. This does not take time information into account. A can happen before or after B.

Causality (EXPERIMENTAL):

A⇒B: If a patient has both diagnoses A and B, how often B comes after A?

If A⇒B = 50% and B⇒A = 50%, then A comes after B as often as B comes after A and there is no causal connection.

Example:
A: O80-O84 Delivery
B: O85-O92: Complications predominantly related to the puerperium
A⇒B = 89%
In 89% of cases when patient has both diagnoses, B comes after A. In reality this should be 100% (?) because puerperium is defined as period that happens after delivery.

A deficiency of the underlying data is that we don't know for certain when did the patient first got a certain diagnosis. We only know when it happened during the 4 year period. If the real date of first diagnosis is before this data range, this will create some inaccuracy in the numbers and make them more difficult to interpret. One case is that when patient has much more visits for diagnosis A than diagnosis B, it is then more likely that diagnosis B will come after A even when there is no causal relationship.

One such case is the following pair:
A: E10-E14 (358,979 patients) Diabetes mellitus
B: O00-O08 (62,554 patients) Pregnancy with abortive outcome
The data shows that if a patient has both diagnoses, then in 76% of cases B comes after A. But patients also have much more visits for A than for B (if they have both), so its not clear if this 76% is result of causal connection or pure chance. This aspect of the data could be controlled to get more accurate estimation of causality, but it is an open question what is the best way to do so.

Typically it is either the case that A comes after B or B comes after A: A⇒B = (100% - B⇒A), but not always. For some pairs of diagnoses it is typical that they are given at the same time; these are not included in A⇒B or B⇒A. Here A and B are interpreted to happen at "same time" if they are given within same 24 hours. For example, O60-O75 (Complications of labour and delivery) and O80-O84 (Delivery), mostly happen during same day.

TO DO